Characterizing viral samples using machine learning for Raman and absorption spectroscopy

Abstract Machine learning methods can be used as robust techniques to provide invaluable information for analyzing biological samples in pharmaceutical industries, such as predicting the concentration of viral particles of interest in biological samples. Here, we utilized both convolutional neural networks (CNNs) and random forests (RFs) to predict the concentration of the samples containing measles, mumps, rubella, and varicella‐zoster viruses (ProQuad®) based on Raman and absorption spectroscopy. We prepared Raman and absorption spectra data sets with known concentration values, then used the Raman and absorption signals individually and together to train RFs and CNNs. We demonstrated that both RFs and CNNs can make predictions with R 2 values as high as 95%. We proposed two different networks to jointly use the Raman and absorption spectra, where our results demonstrated that concatenating the Raman and absorption data increases the prediction accuracy compared to using either Raman or absorption spectrum alone. Additionally, we further verified the advantage of using joint Raman‐absorption with principal component analysis. Furthermore, our method can be extended to characterize properties other than concentration, such as the type of viral particles.

effective purification, polishing steps, and formulation with stable storage conditions. These processes require comprehensive and continuous quality management to maintain the product's efficacy and ensure public safety. With the advancement in viral vectordriven gene therapies and vaccine production, there is a growing interest in improving the continuous production of virus-like particle (VLP)-based vaccines (Gutierrez-Granados et al., 2018). The development of continuous manufacturing processes in the vaccine industry demands rapid, robust, and continuous analytical methods (Process analytical technology [PAT] tools) to understand real-time manufacturing processes (Maruthamuthu, Rudge, et al., 2020).
Noninvasive in-line sensors such as Raman probes (Raman spectroscopy) hold great potential due to their higher sensitivity to read the molecular fingerprints of chemical and biological molecules, species, or products (Butler et al., 2016;Rolinger et al., 2020). Raman spectra possess clear spectral features that can be easily assigned to different chemical compounds. Additionally, minimal sample preparation is sufficient for making accurate quantitative predictions using Raman spectra (Pian et al., 2022). In other words, Raman spectroscopy provides invaluable information for various analyte molecules even in ultra-low concentrations (Panneerselvam et al., 2022).
Similarly, absorption spectroscopy is a robust technique that, owing to its high sensitivity and large signal-to-noise ratio, (Torrisi et al., 2020) has the potential to be implemented as a great tool to make predictions. Generally, both Raman and absorption spectra have been widely used for particle detection and identification (Barnes et al., 2006;Nitkowski et al., 2008;Pallaoro et al., 2015;Probst et al., 2021) and quantitative analysis (Bao et al., 2018;Storey & Helmy, 2019;Strachan et al., 2007).
Recently machine learning (ML) has become popular for making predictions based on spectroscopy data. Both supervised and unsupervised ML techniques have been applied to Raman signals to make predictions (Ralbovsky & Lednev, 2020). Particularly, Raman spectroscopy has been utilized for cancer predictions (Ralbovsky & Lednev, 2020). For instance, techniques, such as principal component analysis or artificial neural networks have been utilized for detecting cervical cancer (Daniel et al., 2018). Furthermore, Raman signals have been utilized for classification problems, such as classifying bacteria (Khan et al., 2018;Koya et al., 2018;Maruthamuthu, Raffiee, et al., 2020;Maruthamuthu, Rudge, et al., 2020) viral, (Ditta et al., 2019;Tong et al., 2019) and fungal infections (Dzurendová et al., 2021;Guo et al., 2021). Additionally, Raman spectroscopy has been applied for regression purposes, such as predicting the concentration of the markers of interest, such as sensing the pH and Lactate in body fluids (Olaetxea et al., 2020). Absorption spectroscopy also has been utilized for classification purposes, such as the characterization of proteins (Zhang et al., 2021) classification of wines, (Philippidis et al., 2020) and quantifying the concentration of organic acids (Wolf et al., 2013). Furthermore, the joint Raman and absorption spectra have been applied to predict the values of concentrations (Isaev et al., 2020).
Previous studies, in particular, have confirmed the capability of ML techniques in making quantitative predictions based on Raman or absorption signals. However, a comparison of these signals and their strength in making accurate ML-based predictions for viral samples, such as MMRV has not been studied before. Here, we aim to create methods based on Raman and absorption spectroscopy that enables monitoring of the concentration of the viral particles in well plates.
Additionally, it is not known whether using Raman and absorption spectra simultaneously can boost the prediction accuracy compared to using only Raman or absorption spectra separately. In our previous study, we demonstrated that deep learning enables the efficient detection of bacteria, fungi, and mammalian cells in static dried-down conditions (Maruthamuthu, Raffiee, et al., 2020). Following our previous study, we intend to build convolutional neural networks (CNNs) and random forests (RFs) models that accept the Raman or absorption spectra or their combination as the input and predict the concentration of samples containing MMRV.

| Data acquisition
All these samples prepared in this study are based on the ProQuad ® , which is a sterile, lyophilized, preservative-free, live virus vaccine that contains measles, mumps, rubella, and varicella-zoster viruses (Kuter et al., 2006). We procured ProQuad ® (manufactured by Merck & Co Inc.,) from the Purdue College of pharmacy and stored it at −20°C.
We prepared the linear dilutions of the ProQuad ® vaccine with a step size of 4% and an initial concentration of 7.20E + 05 plaque-forming units/ml (PFU/ml) (Lyophilized ProQuad ® + 10 µl Diluent). Throughout this article, we refer to the number of infective particles within the sample (PFU) as particles. All the Raman spectra of the ProQuad ® dilutions were collected with the Renishaw in Via TM Qontorconfocal Raman microscope (Renishaw plc) (RENISHAW). We used a 785-nm excitation laser with 100% (300 mW) power and 10 s acquisition time (1 accumulation). The spectral resolution of the spectra was 1 cm −1 , and the spectrum ranged from 101 to 3200 cm −1 corresponding to 3194 Raman shifts. The samples were focused with an X5 objective of a microscope (LeicaDM2700M), and three replicate Raman spectra were collected for each dilution. The sample volume used for the measurement was 100 µl, and the substrate used for the measurements was a 96-well plate (Corning TM 3635 UV-Transparent Microplates). The experiment was repeated once. The raw Raman spectral data was collected using WiRE 5.5 software. Furthermore, we collected the absorption spectrum for ProQuad ® dilutions using the BMG LABTECH, Inc microplate reader (CLARIOstar Plus, SN: 430-2173). The spectrum range was 220 to 1000 nm with a spectral resolution of 1 nm wavelength corresponding to 781 wavelengths.
The sample volume used for the measurement was 100 µl, and the substrate used for the measurements was a 96-well plate (Corning TM 3635 UV-Transparent Microplates). We collected three spectral scans for each dilution. The experiment was repeated once.
In total, the data set includes Raman and absorption spectra for 25 different concentration values with 3 to 6 replicates for each value, making a total of 116 samples, where 20% of this data is used for testing by 5-fold cross-validation as described in Section 2.2.

| Machine learning modeling
We adopt two widely used ML techniques to relate the Raman and absorption spectra to the concentration values: the RF and the CNN techniques. Before training, to ensure the reproducibility of the results, all the models are initialized by setting the seed number to zero. To assess the accuracy of predictions, we use the values of the coefficient of determination (R 2 scores). Further, to train the models, the 5-fold cross-validation technique is used both for the CNNs and RFs. In this method, the whole data is split into five sections, where the model is trained five times, and each time four sections are used as the training data set and one section as the testing data set. The 5-fold cross-validation model ensures that all the data points fall into the testing data set at least once, preventing biased predictions. The Sklearn (Pedregosa et al., 2011) and Pytorch (Paszke et al., 2019) modules in Python are used for modeling the RFs and CNNs, respectively.
CNN is a supervised machine learning technique that, in our case, takes one-dimensional signals as the input and identifies the important parts of the signal, which paves the way for automatic learning of various features and hidden aspects in the signal that are important for the regression. In other words, CNN can capture the spatial and temporal dependencies in the Raman or absorption spectrum. The general architectures of the deep learning models used in this study are similar, that is, a feed-forward single CNN consisting of four convolutional layers followed by four fully connected layers when either Raman or absorption spectrum is used as the input, as shown in Figure 1a. However, when it comes to using both the Raman and absorption spectra as the input, we use two different designs. In one design, we concatenate the Raman and absorption signals and feed them into a single CNN, as shown in Figure 1a. In another design, a double CNN is created for feeding the inputs, as demonstrated in Figure 1b. In the double CNN, the Raman and absorption spectrum are first fed into two separate networks with four convolutional layers and then two fully connected layers.
Eventually, the outputs of each network are concatenated and fed into a network with two fully connected layers. In all models, the architecture used for convolutional layers is based on residual mapping following the deep residual learning method (He et al., 2016). The presence of residual blocks with shortcut connections between inputs and outputs boosts the training stability and paves the way for having deeper layers (He et al., 2016).
Furthermore, the kernel size used for all the convolutional layers is three with zero paddings and strides of one. Additionally, all the networks are trained for 6000 epochs (iterations), where a further increase in the epochs does not significantly boost the prediction accuracy. We use the mean squared loss function as the criterion for training with the back-propagation techniques, where we adopt the stochastic gradient descent with momentum and adaptive learning rate, Adam, (Kingma & Ba, 2014) where the weight decay and learning rate are set to 0.1 and 10 −8 , respectively. Batch normalization and ReLU activation functions are applied consecutively at the Schematic view of the neural network structure when (a) Raman, absorption, or concatenated Raman-absorption spectrum is used as the input (b) both Raman and absorption spectra are used as separate inputs. The number of layers shown here is for illustration purpose and does not reflect the actual values. end of each convolutional layer, and the ReLU function is applied at the end of each fully connected layer. After passing the last ReLU function, the data is mapped into one neuron as the output. The number of channels and neurons are hyperparameters that can be tuned for further accuracy. In the current study, we found that a maximum of 10 channels and 4000 neurons leads to sufficient accuracy while at the same time avoiding over-fitting.
RF regression is a supervised machine-learning technique that utilizes the ensemble average of multiple decision trees to make final predictions (Grömping, 2009). Each one of the trees makes its prediction of the concentration. As shown in Figure 2, the Raman, absorption, or their concatenated spectrum is used as the input with the concentration as the output. RF is a powerful regression technique that runs efficiently on larger data sets. RFs are generally suitable for making predictions in the training range. Additionally, we use the bootstrapping technique, where we select multiple training samples from the original training sample, and these different samples are used for training each one of these decision trees.
Bootstrapping reduces over-fitting chances and stabilizes the network. The squared error criterion in scikit-learn (Pedregosa et al., 2011) is used to measure the quality of splitting for 100 trees.

| RESULTS AND DISCUSSIONS
We use CNN and RF as two powerful ML techniques, with different levels of preprocessing to identify the optimum predictions. Here, we discuss how the algorithms work with the test data generated using 5-fold cross-validation, where each fold can contain points both inside and outside of the training ranges. For CNN, we discuss whether a single or double CNN works better when both Raman and absorption spectra are used as the input.
In this study, CNN models are composed of multiple convolutional layers with a kernel size of three, where, in each layer, by convolving around the signal, hidden features and patterns are learned. To expedite the learning process and improve the model performance, it is beneficial to preprocess the data before training the models. Thus, we apply baseline corrections and normalize the data using the standard normal variate method, that is, subtracting each spectrum by its mean value and dividing by the standard Schematic view of the random forest model composed of multiple decision trees with either Raman, absorption, or concatenated Raman-absorption spectrum as the input. "A" stands for the average. The number of nodes and trees shown are for illustration purposes and do not reflect the actual values.

F I G U R E 3
Raw and preprocessed Raman and absorption plots at two different concentrations deviation described by Romero-Torres et al. (2006) , 2006). Figure 3 demonstrates the Raman and absorption spectra before and after preprocessing for two different concentrations. In addition to normalization and applying filters, some studies trim the Raman spectrum to obtain the spectral range of interest (Pian et al., 2022). In the current study, we did not observe any significant gain in the prediction accuracy when the Raman or absorption spectrum is trimmed, as we have shown, for example, for the RF method in the Appendix. Additionally, we analyzed how the predictions change with the subtraction of the control spectrum of solvent as described in the Appendix, where we noticed a reduction in the accuracy with the subtraction of the control spectrum. Therefore, we excluded the subtraction of the control spectrum step from preprocessing steps.

| CONCLUSION
In the current study, the possibility of using absorption, Raman, and joint Raman-absorption spectrum to determine the concentration of the samples containing viral particles was investigated. RF and CNN, as two different machine learning algorithms, were utilized for making predictions, and the prediction accuracy was monitored using 5-fold cross-validation. We demonstrated that with sufficient preprocessing, both the Raman and absorption spectra could be used to create a surrogate to predict the values of concentration. In most cases, the Raman spectrum leads to more accurate predictions

CONFLICT OF INTEREST
None declared.

DATA AVAILABILITY STATEMENT
The data sets generated and/or analyzed during the current study are Note: The background noise can affect the Raman and absorption spectra, particularly at low Raman shifts and wavelengths. As a result, in this section, we remove the initial parts of the Raman (Raman shift <300 cm −1 ) and absorption spectrum (λ < 250 nm). As shown in Table A3, we note that the prediction accuracies do not change significantly with trimming. Therefore, we used the entire spectra for prediction. Indeed, one of the advantages of using machine learning techniques is that these techniques automatically detect which part of the signal is important. Figure A1 demonstrates the values of importance for the Raman and absorption spectra before and after trimming. The important values are obtained automatically from the Sklearn importance attribute for the RF method (Pedregosa et al., 2011). As evident, we do not notice any significant shift in the important regions of the signals.
T A B L E A3 The R 2 values of 5-fold cross-validation for the prediction of concentration for the trimmed Raman, absorption, and concatenated Raman-absorption spectrum using the RF method Note: In this study, we did not subtract the control spectrum of the solvent (sterile water) from the Raman and absorption spectrum to minimize the amount of preprocessing. Here, we demonstrate how the prediction accuracies change if we subtract the control data from all the spectrums. Figure A2 demonstrates the comparison of the Raman and absorption spectrum for samples that contain viral particles. As evident, the presence of viral particles induces noticeable changes at most Raman shifts. Further, the absorption signal at all wavelengths is different when viral particles are introduced. Additionally, we presented the spectrum with the water data subtracted. Table A4 demonstrates R 2 values for predictions of the RF method using the spectrum with water data subtracted. We note that the R 2 values decrease with the subtraction of water data compared to the values presented in Table A3. Therefore, we excluded the subtraction of the water spectrum step in the preprocessing.

T A B L E A4
The R 2 values of 5-fold cross-validation for the prediction of concentration for the trimmed Raman, absorption, and concatenated Raman-absorption spectrum using the RF method with control data being subtracted  F I G U R E A2 Preprocessed Raman, absorption, and the control (sterile water) spectrum plots in addition to plots with control data subtracted.